This document provides a running example of completing the Week 3 assignment :
In [1]:
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
First let us read the contents of the file ad-clicks.csv. The following commands read in the CSV file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it becomes 'userid'.
Note that you must change the path to ad-clicks.csv to the location on your machine, if you want to run this command on your machine.
In [2]:
adclicksDF = pd.read_csv('./ad-clicks.csv')
adclicksDF = adclicksDF.rename(columns=lambda x: x.strip()) #remove whitespaces from headers
Let us display the first 5 lines of adclicksDF:
In [3]:
adclicksDF.head(n=5)
Out[3]:
Next, We are going to add an extra column to the ad-clicks table and make it equal to 1. We do so to record the fact that each ROW is 1 ad-click. You will see how this will become useful when we sum up this column to find how many ads did a user click.
In [4]:
adclicksDF['adCount'] = 1
Let us display the first 5 lines of adclicksDF and see if a new column has been added:
In [5]:
adclicksDF.head(n=5)
Out[5]:
Next, let us read the contents of the file buy-clicks.csv. As before, the following commands read in the CSV file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it becomes 'userid'.
Note that you must change the path to buy-clicks.csv to the location on your machine, if you want to run this command on your machine.
In [6]:
buyclicksDF = pd.read_csv('./buy-clicks.csv')
buyclicksDF = buyclicksDF.rename(columns=lambda x: x.strip()) #removes whitespaces from headers
Let us display the first 5 lines of buyclicksDF:
In [7]:
buyclicksDF.head(n=5)
Out[7]:
For this exercise, we can choose from buyclicksDF, the 'price' of each app that a user purchases as an attribute that captures user's purchasing behavior. The following command selects 'userid' and 'price' and drops all other columns that we do not want to use at this stage.
In [8]:
userPurchases = buyclicksDF[['userId','price']] #select only userid and price
userPurchases.head(n=5)
Out[8]:
Similarly, from the adclicksDF, we will use the 'adCount' as an attribute that captures user's inclination to click on ads. The following command selects 'userid' and 'adCount' and drops all other columns that we do not want to use at this stage.
In [9]:
useradClicks = adclicksDF[['userId','adCount']]
In [10]:
useradClicks.head(n=5) #as we saw before, this line displays first five lines
Out[10]:
From each of these single ad-clicks per row, we can now generate total ad clicks per user. Let's pick a user with userid = 3. To find out how many ads this user has clicked overall, we have to find each row that contains userid = 3, and report the total number of such rows. The following commands sum the total number of ads per user and rename the columns to be called 'userid' and 'totalAdClicks'. Note that you may not need to aggregate (e.g. sum over many rows) if you choose a different feature and your data set already provides the necessary information. In the end, we want to get one row per user, if we are performing clustering over users.
In [11]:
adsPerUser = useradClicks.groupby('userId').sum()
adsPerUser = adsPerUser.reset_index()
adsPerUser.columns = ['userId', 'totalAdClicks'] #rename the columns
Let us display the first 5 lines of 'adsPerUser' to see if there is a column named 'totalAdClicks' containing total adclicks per user.
In [12]:
adsPerUser.head(n=5)
Out[12]:
Similar to what we did for adclicks, here we find out how much money in total did each user spend on buying in-app purchases. As an example, let's pick a user with userid = 9. To find out the total money spent by this user, we have to find each row that contains userid = 9, and report the sum of the column'price' of each product they purchased.
The following commands sum the total money spent by each user and rename the columns to be called 'userid' and 'revenue'.
Note: that you can also use other aggregates, such as sum of money spent on a specific ad category by a user or on a set of ad categories by each user, game clicks per hour by each user etc. You are free to use any mathematical operations on the fields provided in the CSV files when creating features.
In [13]:
revenuePerUser = userPurchases.groupby('userId').sum()
revenuePerUser = revenuePerUser.reset_index()
revenuePerUser.columns = ['userId', 'revenue'] #rename the columns
In [14]:
revenuePerUser.head(n=5)
Out[14]:
Lets see what we have so far. We have a table called revenuePerUser, where each row contains total money a user (with that 'userid') has spent. We also have another table called adsPerUser where each row contains total number of ads a user has clicked. We will use revenuePerUser and adsPerUser as features / attributes to capture our users' behavior.
Let us combine these two attributes (features) so that each row contains both attributes per user. Let's merge these two tables to get one single table we can use for K-Means clustering.
In [15]:
combinedDF = adsPerUser.merge(revenuePerUser, on='userId') #userId, adCount, price
Let us display the first 5 lines of the merged table. Note: Depending on what attributes you choose, you may not need to merge tables. You may get all your attributes from a single table.
In [16]:
combinedDF.head(n=5) #display how the merged table looks
Out[16]:
Our training data set is almost ready. At this stage we can remove the 'userid' from each row, since 'userid' is a computer generated random number assigned to each user. It does not capture any behavioral aspect of a user. One way to drop the 'userid', is to select the other two columns.
In [17]:
trainingDF = combinedDF[['totalAdClicks','revenue']]
trainingDF.head(n=5)
Out[17]:
Display the dimension of the training data set. To display the dimensions of the trainingDF, simply add .shape as a suffix and hit enter.
In [18]:
trainingDF.shape
Out[18]:
The following two commands convert the tables we created into a format that can be understood by the KMeans.train function.
line[0] refers to the first column. line[1] refers to the second column. If you have more than 2 columns in your training table, modify this command by adding line[2], line[3], line[4] ...
In [19]:
sqlContext = SQLContext(sc)
pDF = sqlContext.createDataFrame(trainingDF)
parsedData = pDF.rdd.map(lambda line: array([line[0], line[1]])) #totalAdClicks, revenue
Here we are creating two clusters as denoted in the second argument.
In [20]:
my_kmmodel = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")
In [21]:
print(my_kmmodel.centers)
Each array denotes the center for a cluster:
One Cluster is centered at ... array([ 29.43211679, 24.21021898])
Other Cluster is centered at ... array([ 42.05442177, 113.02040816])
First number (field1) in each array refers to number of ad-clicks and the second number (field2) is the revenue per user. Compare the 1st number of each cluster to see how differently users in each cluster behave when it comes to clicking ads. Compare the 2nd number of each cluster to see how differently users in each cluster behave when it comes to buying stuff.
In one cluster, in general, players click on ads much more often (~1.4 times) and spend more money (~4.7 times) on in-app purchases. Assuming that Eglence Inc. gets paid for showing ads and for hosting in-app purchase items, we can use this information to increase game's revenue by increasing the prices for ads we show to the frequent-clickers, and charge higher fees for hosting the in-app purchase items shown to the higher revenue generating buyers.
Note: This analysis requires you to compare the cluster centers and find any ‘significant’ differences in the corresponding feature values of the centers. The answer to this question will depend on the features you have chosen.
Some features help distinguish the clusters remarkably while others may not tell you much. At this point, if you don’t find clear distinguishing patterns, perhaps re-running the clustering model with different numbers of clusters and revising the features you picked would be a good idea.
In [ ]: